-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run one deployment orchestrator per deployment per cluster #1865
base: main
Are you sure you want to change the base?
Conversation
Very cool. Running orchestrators to do nothing does seem like a waste, though I guess orchestrators do have ad-hoc work as new devices join the deployment. I guess throughput for highly concurrent rollouts (1000-2000 at a time) of hundreds of thousands of total devices is a bit of a question. How fast is it to have a single node doing that work? The way orchestrators work now is unusual and we definitely want all this under test regardless of strategy. I think having multiple nodes share the work of pushing out updates has some upside in terms of efficiency and reliably getting updates out but it does have drawbacks for concurrency control and monitoring the process. I guess focusing it back into one node makes it a lot easier to check the results of the fleet health as the deployment rolls out as well. I am overall in favor. I haven't worked with Horde before so don't instinctively trust it but have heard good things :D |
4f2dfff
to
4db1922
Compare
eb9d52a
to
5730bfb
Compare
Device connections might be flakey, but the update might still be happening in the background.
If the inflight update firmware uuid matches the device metadata then the update was likely a success.
Otherwise the delay in starting can cause it to miss some broadcasts, and make testing a bit harder.
This adds a new event that is sent to the orchestrator when the device is fully 'online' and has gone through the `after_boot` and device registration steps.
If the devices firmware matches the deployments, there is no need to trigger an orchestrator run
Since we now tell the deployment that a device assigned to it is online, we can place the "device finished updating" broadcast in the right place
This reduces subscribers to `deployment:#{id}` receiving a bunch of messages which mean nothing to them
Add a 10sec buffer between orchestrator runs. This is done by using a `send_after` timer ref, tracking if another call has been made, and allowing for the buffer to be skipped during testing.
…es-hub/nerves_hub_web into distributed-deployment-orchestrator
A device now goes from `:connecting` to `:connected`, signifying that it is ready to receive updates. Using this different status allows us to tell the orchestrator to only schedule update for devices that have "finished" connecting
e3d41a1
to
70eacb5
Compare
This provides a nice cleanup in our new orchestrator and tests. It's just a normal GenServer and doesn't know anything of `ProcessHub`, which is just a way of running it at scale.
Currently we run one
Deployments.Orchestrator
per deployment per node, it doesn't matter if the deployment is active or not, it will still be run on each node.If we have 100 deployments, and 10 nodes, then we have 100 orchestrators per node, and 1000 overalls.
This architecture, specifically having an orchestrator for a deployment run on every node, makes it very hard to ensure only a certain number of concurrent updates are running at the same time. And it also makes it hard to build out deployment workflows.
This PR uses
ProcessHub
to ensure one orchestrator is running per deployment across the cluster. Additionally, it only runs orchestrators that are set as active.This ensures centralised management of a deployment, while using
ProcessHub
to redistribute orchestrators to other nodes if a node is to go offline.The current implementation allows for us to switch between strategies.
If we commit to this path (after testing in the field), we can remove the device registry, simplifying device connections. Plus, it will also enable us to add support for other device transports and transport setups, like a socket proxy or mqtt.
To run this locally, open up two terminals and run:
and
Orchestrators are set to only run on web nodes, which also builds up to us supporting a 'device proxy' setup.
Still required from this PR: